Off-policy and on-policy reinforcement learning with the Tsetlin machine
نویسندگان
چکیده
Abstract The Tsetlin Machine is a recent supervised learning algorithm that has obtained competitive accuracy- and resource usage results across several benchmarks. It been used for convolution, classification, regression, producing interpretable rules in propositional logic. In this paper, we introduce the first framework reinforcement based on Machine. Our integrates value iteration with regression as function approximator. To obtain accurate off-policy state-value estimation, propose modified feedback mechanism adapts to dynamic nature of iteration. particular, show able unlearn recover from misleading experiences often occur at beginning training. A key challenge address mapping intrinsically continuous architecture, leveraging probabilistic updates. While off-policy, learns significantly slower than neural networks on-policy. However, by introducing multi-step temporal-difference combination high-frequency logic patterns, are close performance gap. Several gridworld instances document our can outperform comparable network models, despite being simple one-level AND-rules Finally, how class models learnt problem be translated into more understandable graph structure. structure captures approximation corresponding policy found
منابع مشابه
On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning
Temporal-difference-based deep-reinforcement learning methods have typically been driven by off-policy, bootstrap Q-Learning updates. In this paper, we investigate the effects of using on-policy, Monte Carlo updates. Our empirical results show that for the DDPG algorithm in a continuous action space, mixing on-policy and off-policy update targets exhibits superior performance and stability comp...
متن کاملSafe and Efficient Off-Policy Reinforcement Learning
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of “off-policyness”; and (3) it is efficient as it makes the b...
متن کاملData-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have ...
متن کاملInterpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning
Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging onand off-policy updates for deep reinforcement learning. Theoretical resu...
متن کاملOff-Policy Shaping Ensembles in Reinforcement Learning
Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel without sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based shaping rewards. The ensembl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Applied Intelligence
سال: 2023
ISSN: ['0924-669X', '1573-7497']
DOI: https://doi.org/10.1007/s10489-022-04297-3